比利时专利BE1024766B1 Method for typing nucleic acid or amino acid sequences based on sequence analysis

专利PDF首页>>比利时专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The invention relates to a method for determining the presence or absence of predefined alleles in a reading sequence set, comprising: a) defining a k lake (nucleic acid sequence with length 'k') and a k lake space of all permutations of nucleic acids (4k) with length "k"; b) determining for each predefined allele which k mer is present in said predefined allele, thereby obtaining an allele-associated k mer set; c) provide a reading sequence collection; d) determine the occurrence number of each k-mer in said reading sequence set for each predefined allele and allele-associated k-mer collection, e) filter the occurrence by resetting it to 0 if: i) the total occurrence, ii) the total occurrence in the forward direction, or iii) the total occurrence number in the reverse direction is below a predefined threshold value; f) determining the presence or absence of each predefined allele in the reading sequences based on the filtered for number of cells obtained in e).
公开号:BE1024766B1
申请号:E2016/5082
申请日:2016-02-02
公开日:2018-06-25
发明作者:Hannes Pouseele；Koen Janssens
申请人:Applied Maths Nv；
IPC主号:

专利说明:

(30) Priority data:
02/02/2015 EP 15153406.2 (73) Holder (s):
APPLIED Maths NV
9830, SINT-MARTENS-LATEM
Belgium (72) Inventor (s):
POUSEELE Hannes 8400 Ostend Belgium
JANSSENS Koen 9040 GHENT Belgium (54) Method for typing nucleic acid or amino acid sequences based on sequence analysis (57) The invention relates to a method for determining the presence or absence of predefined alleles in a read sequence set, comprising: a) a kmeer (nucleic acid sequence with length 'k') and a k-mer space of all permutations of nucleic acids (4k) with length 'k'; b) determining for each predefined allele which k mer is present in said predefined allele, thereby obtaining an allele associated k mer library; c) provide a reading sequence set; d) determine the occurrence number of each k-mer in said read sequence set for each predefined allele and allele-associated kmeer set, e) filter the occurrence number by resetting it to 0 if: i) the total occurrence number, ii) the total occurrence number in the forward direction, or iii) the total number in the backward direction is below a predefined threshold; f) determining the presence or absence of each predefined allele in the reading sequences based on the filtered occurrence number obtained in e).
Figure 1
fcA * î iijr i H i ».. 0 0 © F 4, 0 0 0 0 _:: you O - t-: îh- « ^: « t ~ S ™ W. you V 0 f ** f- A- . μ. L ± _ £ j © 0 0 © < « YOU O 0 - 0 0 0 j l · ”* YOU H
BELGIAN INVENTION PATENT
FPS Economy, K.M.O., Self-employed & Energy
Publication number: 1024766 Filing number: BE2016 / 5082
Intellectual Property Office
International Classification: G06F 19/22 C12Q 1/68 Date of Issue: 25/06/2018
The Minister of Economy,
Having regard to the Paris Convention of 20 March 1883 for the Protection of Industrial Property;
Having regard to the Law of March 28, 1984 on inventive patents, Article 22, for patent applications filed before September 22, 2014;
Having regard to Title 1 Invention Patents of Book XI of the Economic Law Code, Article XI.24, for patent applications filed from September 22, 2014;
Having regard to the Royal Decree of 2 December 1986 on the filing, granting and maintenance of inventive patents, Article 28;
Having regard to the application for an invention patent received by the Intellectual Property Office on 02/02/2016.
Whereas for patent applications that fall within the scope of Title 1, Book XI, of the Code of Economic Law (hereinafter WER), in accordance with Article XI.19, § 4, second paragraph, of the WER, the granted patent will be limited. to the patent claims for which the novelty search report was prepared, when the patent application is the subject of a novelty search report indicating a lack of unity of invention as referred to in paragraph 1, and when the applicant does not limit his filing and does not file a divisional application in accordance with the search report.
Decision:
Article 1
APPLIED Maths NV, Keistraat 120, 9830 SINT-MARTENS-LATEM Belgium;
represented by
DUYVER Jurgen, Holidaystraat 5, 1831, DIEGEM;
CAERS Raf, Holidaystraat 5, 1831, DIEGEM;
VAN REET Joseph, Holidaystraat 5, 1831, DIEGEM;
GEVERS PATENTS, Holidaystraat 5, 1831, DIEGEM;
a Belgian invention patent with a term of 20 years, subject to payment of the annual fees as referred to in Article XI.48, § 1 of the Code of Economic Law, for: Method for typing nucleic acid or amino acid sequences on the basis of of sequence analysis.
INVENTOR (S):
POUSEELE Hannes, Independence Street 10, 8400, Ostend;
JANSSENS Koen, Antwerpsesteenweg 344, 9040, GHENT;
PRIORITY:
02/02/2015 EP 15153406.2;
BREAKDOWN:
Split from basic application: Filing date of the basic application:
Article 2. - This patent is granted without prior investigation into the patentability of the invention, without warranty of the Merit of the invention, nor of the accuracy of its description and at the risk of the applicant (s).
Brussels, 25/06/2018,
With special authorization:
B Ε 2016/5082
METHOD FOR TYPING NUCLEIC ACID OR AMINO ACID SEQUENCES BASED ON SEQUENCE ANALYSIS
Technical domain the invention
The technical fields of application of the present invention are bioinformatics, genetics, microbiology, epidemiology and evolutionary research.
Technical background of the invention
The present invention is situated in the field of molecular bacterial typing and subtyping. The purpose of each (sub) typing method is to more accurately identify bacteria than the level of the species (or subspecies), and to group individual isolates in a meaningful way. Ideally, a typing method should have sufficient typability (the ability to unambiguously type isolates), reproducibility and transportability (the ability to perform the method in a reproducible and fully compatible manner in different laboratories at different times), it should be relatively easy and must have sufficient distinctive ability [1].
The ability to do this quickly and reliably is the cornerstone of all laboratory monitoring [21]. Isolates with indistinguishable subtypes are more likely to come from a common source than those with different subtypes. This concept forms the basis for applying molecular subtyping to bacterial pathogens for surveillance, outbreak detection and outbreak response. In order to be considered suitable for laboratory monitoring and outbreak detection, a sub-typing method should be evaluated against several important performance criteria [21]: typability, reproducibility, distinctiveness and epidemiological concordance. These criteria must be evaluated using an epidemiologically relevant range of isolates from a geographically as diverse region as where the method will be applied.
Additional criteria for evaluating the workability of the method
B E2016 / 5082 include speed, throughput, cost, ease of use, objectivity, versatility and portability. The importance of these criteria is even greater for the successful application of a sub-typing method in surveillance between different laboratories [22]. Especially in the new domain of sub-typing based on short reading sequence data, there is a need for efficient and reliable analysis strategies that match the raw data accurately and truthfully.
The present invention addresses at least one of the above-mentioned shortcomings or satisfies at least one of the above-mentioned requirements.
Molecular typing methods can be divided into two groups: phenotypic and genotypic methods. In the 1980s, only phenotypic techniques were available. Phenotypic tests had low reproducibility, low typability, and insufficient distinctiveness to be used in epidemiological studies. In this epidemiological context, genotypic techniques with better typability and distinctive ability replaced phenotypic methods [1] in the 1990s. Genotypic methods are classified into i) methods based on streak formation, which involve an analysis of DNA stripe patterns by gel electrophoresis and which Also referred to as gel-based typing methods, and ii) sequence-based methods using the analysis of DNA sequences.
The most commonly used methods based on streaking were restriction endonuclease analysis (REA), pulsed field gel electrophoresis (PFGE), capillary or conventional PCR ribotyping, multi-locus enzyme electrophoresis (MLEE), and multi-locus tandem repeat analysis with variable number (MLVA), while the most commonly used sequence-based genotyping method was multi-locus sequence typing (MLST, or as a variant single locus sequence typing).
In the early 1990s, MLEE was the only appropriate method for studying the global or long-term spread of bacteria strains. This method identifies variants of the gene products of 10-20 "household genes" (genes encoding
B E2016 / 5082 basic metabolic functions), using electrophoresis of cell extracts on starch gels, followed by detection using specific enzyme stains. In most bacterial populations, a number of Variants with different charges are present for each enzyme, reflecting small differences in the amino acid sequences of the proteins, and therefore also of their corresponding gene sequences, and these Variants are considered to be different alleles. Isolates with the same alleles in every household locale are believed to be very closely related, and are classified with the same clone (strain).
However, MLEE, like other streak-based processes, has major shortcomings, in particular the fact that results from one laboratory are very difficult to compare with those from another. The next step was to convert the method into a procedure based on DNA sequences, so that the different DNA sequences in each pathogen (every household locus) in a pathogen were directly distinguished as different alleles, instead of indirectly dividing alleles by based on differences in the electrophoretic mobility of their gene products on starch gels. This simple modification brings enormous benefits because sequence data is unambiguous and easy to compare between laboratories, and the alleles at each locus, and the allele profiles and isolate information for each pathogen, can be stored in online databases that can be accessed over the Internet consulted. Fewer loci need to be used than with MLEE, since sequencing identifies more alleles per locus, and allows for simple nomenclature because each different allele profile can be classified as a different sequence type (ST), providing a useful master description [ 2],
Multi-locus sequence typing (MLST) was introduced in 1998 [23] and has proved to be an extremely successful approach for molecular typing. Rapidly, MLST schemes were developed for most of the major bacterial pathogens, with seven household loci being standard. This sequence-based typing method relies on sequencing DNA fragments in the range of
B E2016 / 5082
300 to 500 bp and representing seven household genes (MLST 7HG). Sequence variants for each household are assigned to a separate allele number and the combination of seven allele numbers (allele profile) yields a sequence type (ST). MLST produces high throughput sequence data that can be uploaded from laboratories around the world to a common database on the web [4]. This makes it easier to identify sequence types and to study the population structure and global epidemiology of bacteria [3].
In addition to the unambiguousness of MLST and the ease of exchanging and comparing allele profiles with seven numbers, MLST offers another advantage. The allele profiles of isolates can be obtained from clinical material by the direct PCR amplification of the seven household loci from CSF or blood. For example, isolates can be precisely characterized, even when they cannot be grown from clinical material. A practical disadvantage of MLST is the relatively high cost of sequencing multiple targets. In addition, the distinctive ability of current MLST schedules examining seven household loci is suboptimal for several epidemiological questions requiring more micro-epidemiological analysis [5-9], MLST further requires expertise in bioinformatics and genetics to interpret data correctly [10],
Recent major advances in DNA sequencing technology, with bacterial genome sequences at high processing rates, make it possible to sequence the genomes of many thousands of isolates of a pathogen species. Mass parallel sequencing produces immensely large numbers (millions) of short sequences, for which the term "readings" is used, ranging from about 50 to several hundred or several thousand nucleic acids in length.
These short sequences must be assembled before most genome analyzes can be started. Genome analysis techniques have traditionally focused on approaches based on
BE2016 / 5082 assembly [11], where the raw data is either aligned with a closely related reference genome [12-13], or assembled de novo and assembled on a scaffold using various algorithms [14-16], The traditional process aligning these readings with a reference genome is time consuming, and current alignment tools make several compromises between the accuracy and speed of mapping. Furthermore, it is computationally difficult to assemble short readings without reference. Genome assembly remains a very difficult problem, made even more burdensome by shorter readings and unreliable long-distance pairing information.
Despite the technical difficulties associated with the use of short reading sequencing technologies, the availability of sequence information from not only a small portion of the genome, but even the entire genome, presents enormous potential for bacterial strain typing. . In particular, the emergence of high processing speed sequencing technologies puts an end to the limiting choice to use 7 genes to characterize bacterial strains, a number that was largely motivated by cost and effort considerations, and is implementing MLST on the scale of a complete genome is a viable strategy. In this context, a locus is defined as a set of sequences of nucleic acids or amino acids that are in some way closely related, for example, based on sequence similarities, or on a biological or functional basis, i.e., phenotypic similarities. The nucleic acid or amino acid sequences in a locus are called alleles or variants for the locus.
A number of approaches have been developed to perform MLST, whole genome MLST, or similar analyzes based on short readings.
Certain approaches to genome analysis without assembly rely on local reading mapping, comparing each of the readings to a partial reference genome, thereby obtaining one or more alignments between each reading and the reference genome. A technique known in the art is short reading sequence typing
B E2016 / 5082 (SRST), which uses the local mapping of readings and all calculations are performed on the basis of this mapping, such as that of the allele assignment score [17]. SRST uses the software packages Burrows-Wheeler Aligner (BWA) and the Sequence Alignment / Map format (SamTools) used. The SRST algorithm is based on a local assembly and does not use a k-mer profile approach for allele detection and loci prediction.
Another unassembled technique, SpolPred, is aimed at performing spoligotyping in silico, a typing technique for bacteria in the Mycobacterium tuberculosis-comp (exBC), the best known of which is Mycobacterium tuberculosis, which causes tuberculosis. Spoligotyping is based on the presence or absence of a collection of 43 specific nucleotide sequences (called spacers) in the entire genome of an MTBC sample, for which only the presence or absence of the 43 loci is important, not the exact allelic variant . Since tuberculosis bacterial strains vary in appearance of this deliberate collection of spacers, each of these strains yields a specific staining pattern (where the term 'stain' is derived from hybridization analyzes, the classic way of determining the presence or absence of the spacers), then this staining pattern is translated into a 15-digit numeric code (called "octal code") for each stem. The online database SITVITWEB contains 2,740 shared types, also known as 'spoligotype international types' (SITs), which have been found among 58180 clinical isolates, grouped into a list of 62 (sub) lineages useful for studying the geographical distribution of MTBC (sub) ancestry lines. SpolPred then uses a unique 25 bp spacer (sequence) that is tested against each reading, tolerating up to one mismatch. The publication of all expected reviews is eventually translated into an octal code, which is then linked to an appropriate spoligotype in the SITVITWEB database [18]. Unfortunately, SpolPred does not predict the presence of unknown variants.
BE2016 / 5082
The BIGSDb software package uses de novo assembly (in particular, the Velvet Optimizer algorithm [39]) to assemble the short readings into a collection of contigs, then uses BLAST [40] to identify the allele variants present in the sample [19 ].
It should be clear from the above description that the concept of building a unique k-mer profile for a locus, bringing together the characteristic features of a number of alleles (or Variants), as far as the inventors can estimate, is still lacking in the state of technology.
Summary of the invention
An object of the present invention is to provide a method for (sub) typing nucleic acid and amino acid sequences, wherein this method is not limited to a particular microorganism.
An object of the present invention is to provide a method for (sub) typing nucleic acid and amino acid sequences, wherein this method can be applied to sequences comprising unnatural, modified or analogous nucleic acids or amino acids.
An object of the present invention is to provide a method for (sub) typing nucleic acid and amino acid sequences that are not assembled. In other words: a method for (sub) typing short lectures.
An object of the present invention is to provide a method for (sub) typing nucleic acid and amino acid sequences that have been assembled or partially assembled.
An object of the present invention is to provide a method for (sub) typing nucleic acid and amino acid sequences that are not assembled, and that come from samples that have not been cultured. In other words: a method for (sub) typing short readings from samples that have not been cultivated.
An object of the present invention is to provide a method for (sub) typing nucleic acid and amino acid sequences that have been assembled or partially assembled, and that come from samples that have not been cultured.
BE2016 / 5082
An object of the present invention is to provide a method for detecting the presence of known alleles of a locus in nucleic acid and amino acid sequences that are not assembled. In other words, a method for detecting the presence of known alleles of a locus in short readings.
An object of the present invention is to provide a method for detecting the presence of known alleles of a locus in nucleic acid and amino acid sequences that have been assembled or partially assembled.
An object of the present invention is to provide a method for detecting the presence of unknown alleles of a locus in nucleic acid and amino acid sequences that are not assembled. In other words, a method for detecting the presence of unknown alleles of a locus in short readings.
An object of the present invention is to provide a method for detecting the presence of unknown alleles of a locus in nucleic acid and amino acid sequences that have been assembled or partially assembled.
An object of the present invention is to provide a (sub) typing method that is not limited to the housekeeping genes, that is, it can be used to analyze any one or more alleles of any locus.
An object of the present invention is to provide a method for (sub) typing nucleic acid and amino acid sequences for more than one locus at the same time.
An object of the present invention is to provide a method for (sub) typing with a higher degree of at least one of the crucial performance criteria, selected from reproducibility, typability, objectivity, portability, discriminative power and epidemiological concordance.
The present invention relates to a method for determining the presence or absence of one or more predetermined nucleic acid sequences (referred to as "alleles") in a set of nucleic acid SE2016 / 5082 sequences (referred to as "reading sequences"), said method comprising:
a) defining a k-mer, which is a nucleic acid sequence of length 'k' (for any natural number k), and a k-mer space, which consists of all permutations of nucleic acids (4 ^k ) of the chosen length 'k', and wherein those nucleic acid sequences that form each other's reverse complement (i.e., that they are identical except for their difference in direction, that direction being either a forward (5 '->3') or a reverse ( 3 '->5') direction), are considered equivalent within the k-lake;
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in said each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) providing a collection of reading sequences;
d) for each of the one or more predetermined alleles and corresponding allele-associated k mer sets, determining the number of occurrences of each k mer (from the allele-associated k mer set) in said set of reading sequences, said k-mers equivalent in the forward (5 '-> 3') and reverse (3 '-> 5') direction are defined as equivalent k-mers within the k-moor set, but still separate from each other if desired. be distinguished; wherein the occurrence of a k-mer is revocable based on appropriate quality scores for individual nucleic acids in the reading sequence;
e) filtering the determined number of times each k mer (from the allele-associated k mer library) from step d); prevents; by resetting to 0 this number of times each k-more occurs if: i) the total number of times it occurs (either in forward (5 '-> 3') or backward (3 '-> 5') ) direction) under a predetermined
BE2016 / 5082 threshold value is, ii) the total number of times it occurs in the forward (5 '-> 3') direction is below a predetermined threshold value, or iii) the total number of times it occurs in the reverse (3 ') -> 5 ') direction is below a predetermined threshold value;
f) determining the presence or absence of each predetermined allele in the reading sequences based on the filtered number of times each k mer (from the allele-associated k mer library) obtained in step e).
The present invention further relates to a method for determining the presence or absence of one or more pools (referred to as loci ') containing one or more predetermined nucleic acid sequences (referred to as' alleles') in a set of nucleic acid sequences (referred to as' reading sequences') wherein said method comprises:
a) defining a k mer, which is a nucleic acid sequence of length 'k' (for any natural number k), and a k mer space, which consists of all permutations of nucleic acids (4 ^k ) with the chosen length 'k', and in which those nucleic acid sequences that form each other's reverse complement (that is, that they are identical except for their difference in direction, that direction being either a forward (5 '->3') or a reverse (3 ') -> 5 ') direction), are considered equivalent within the k-lake;
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in said each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) for each locus, determining which one or more allele-associated k mer sets are present in a locus, and optionally determining the number of occurrences of each k mer, as defined in step a);
d) providing a collection of reading sequences;
B E2016 / 5082
e) for each locus and associated allele-associated k mer sets, determining the presence of said one or more allele-associated k lake sets and the number of times each k more (of each of the allele associated k- sets) multiple sets) occurs in said set of reading sequences, those k-mers equivalent in the forward (5 '-> 3') and backward (3 '-> 5') direction being defined as equivalent k-mers within the set of k- lakes, but can still be distinguished from each other if desired; wherein the occurrence of a k-mer is revocable based on appropriate quality scores for individual nucleic acids in the reading sequence;
f) filtering the determined number of times each k-mer (of each of the allele-associated k-mer sets) from step e); by resetting to 0 this number of times each k-more occurs if: i) the total number of times it occurs (either in forward (5 '-> 3') or backward (3 '-> 5') ) direction) is below a predetermined threshold, ii) the total number of occurrences in the forward (5 '-> 3') direction is below a predetermined threshold, or iii) the total number of occurrences in the backward (3 '-> 5') direction is below a predetermined threshold;
g) determining the presence or absence of a loci containing one or more predetermined alleles in the reading sequences based on the filtered number of times obtained each k-mer (of each of the allele-associated k-) multiple collections).
The present invention further relates to a method for determining the presence or absence of one or more predetermined amino acid sequences (referred to as "alleles") in a set of amino acid sequences (referred to as "reading sequences"), said method comprising:
a) defining a k-mer, which is an amino acid sequence of length 'k' (for any natural number k), and a k multi-space BE2016 / 5082 te, which consists of all permutations of amino acids (20 ^k - 22 ^k ) with the chosen length 'k';
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) providing a collection of reading sequences;
d) for each of the one or more predetermined alleles and corresponding allele-associated k mer sets, determining the number of occurrences of each k mer (of the allele-associated k mer set) in said set of reading sequences, the occurrence of a k-mer is revocable based on appropriate quality scores for individual amino acids in the reading sequence;
g) filtering the determined number of times each k-mer (of the allele-associated k-mer collection) from step d); by resetting to 0 this number of times each k-more occurs if the total number of times it occurs is below a predetermined threshold value;
h) determining the presence or absence of each predetermined allele in the reading sequences based on the filtered number of times each k mer (from the allele-associated k mer library) obtained in step e).
The present invention further relates to a method for determining the presence or absence of one or more sets (referred to as loci ') containing one or more predetermined amino acid sequences (referred to as' alleles') in a set of amino acid sequences (referred to as' reading sequences') wherein said method comprises:
a) defining a k-mer, which is an amino acid sequence of length 'k' (for any natural number k), and a k multi-space BE2016 / 5082 te, which consists of all permutations of amino acids (20 ^k - 22 ^k ) with the chosen length 'k';
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in said each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) for each locus, determining which one or more allele-associated k mer sets are present in a locus, and optionally determining the number of occurrences of each k mer, as defined in step a);
d) providing a collection of reading sequences;
e) for each locus and associated allele-associated k mer sets, determining the presence of said one or more allele-associated k lake sets and the number of times each k more (of each of the allele associated k- sets) multi-sets) in said set of read sequences, the occurrence of a k-mer being revocable based on appropriate quality scores for individual amino acids in the read sequence;
f) filtering the determined number of times each k-mer (from each of the allele-associated k-mer collections) from step e); by resetting to 0 this number of times each · k-more occurs if the total number of times it occurs (either in forward (5 '-> 3') or in reverse (3 '-> 5') direction ) is below a predetermined threshold;
g) determining the presence or absence of a loci containing one or more predetermined alleles in the reading sequences based on the filtered number of times obtained each k-mer (of each of the allele-associated k-) multiple collections).
B E2016 / 5082
The present invention further relates to a system for determining the presence or absence of one or more loci containing one or more predetermined nucleic acid or amino acid sequences or variants thereof, or for determining the presence or absence of one or more predetermined nucleic acid or amino acid sequences or Variants thereof, in a collection of reading sequences, the system comprising at least one processor and an associated storage medium containing a program executable by said at least one processor, the system comprising software code portions executing the steps which are defined in any of the embodiments set forth herein, in any logical order.
The present invention further relates to a non-volatile storage medium on which a computer program product is stored that includes software code portions in a format executable on a computer device, and configured to perform the steps defined in any of the embodiments set forth herein, in any logical order when they are executed on said computer device.
The present invention further relates to a computer program product that is executable on a computer device and which includes software code for performing the method according to any of the embodiments set forth herein when it is run on said computer device.
Brief description of the Figures
Figure 1 shows a comparison of a portion of the allele sequences found by the algorithms in BIGSDb and that of the present invention for isolate OXC6347 at the locus CAMP1442. Allele 69 comes from the BIGSDb algorithm, and allele 1 comes from the algorithm of the present invention.
Figure 2 shows the existing readings, including position 82126.
Figure 3 shows the number of mismatched alleles as identified by the BIGSDb algorithm (shown in black) and by the algorithm of the present invention (shown in white).
BE2016 / 5082
Description of the invention
The present invention relates to a method, a system, a non-volatile storage medium, and a computer program product, as defined here and in the claims.
In one embodiment, in the method according to any of the embodiments set forth herein, in the absence of a predefined allele or a loci containing one or more predefined alleles, the method further comprises determining the percentage of sequence- identity of one or more of the k-lake compared to the reading sequences.
In one embodiment, in the method of any of the embodiments set forth herein, the library of read sequences are unassembled, assembled, or partially assembled sequences.
In one embodiment, in the method of any of the embodiments set forth herein, the unassembled sequences are obtained from a sequencing platform selected from Sanger sequencing, pyro sequencing, sequencing sequences, and any other type that sequences of nucleic acids or amino acids.
In one embodiment, in the method of any of the embodiments set forth herein, the unassembled, assembled, or partially assembled sequences are obtained from any biological material, or from data in silico.
In one embodiment, in the method of any of the embodiments set forth herein, the biological material is selected from one or more Organisms and any portion thereof.
In one embodiment, in the method according to any of the embodiments set forth herein, the one or more organism are selected from prokaryotes, including bacteria and archaea, viruses, fungi, microscopic arthropods, microscopic crustaceans, any pathogen , chimeric or artificially created microorganism, and any mixture thereof.
BE2016 / 5082
In one embodiment, in the method of any of the embodiments set forth herein, the k is selected from 11 to 71 nucleic acids.
In one embodiment, in the method of any of the embodiments set forth herein, the k is selected from 5 to 23 amino acids.
In one embodiment, in the method according to any of the embodiments set forth herein, the method is used for typing or subtyping; multi-locus sequence typing (MLST); extensive multi-locus sequence typing (eMLST); ribosomal multi-locus sequence typing (rMLST); nuclear genome sequence typing (cgMLST); whole genome multilocus sequence typing (wgMLST, MLST +); spoligotyping for Mycobacterium Tuberculosis; detection of large sequence polymorphism (LSP) for Mycobacterium Tuberculosis; Taqman® based SNP analysis; single locus sequence typing (or allele typing); antibiotic resistance typing; antigen serotyping; SPA typing (serine protease car transporters from Enterobacteriaceae); prediction of drug resistance to HIV, HBV, or HCV; typing of DRU (direct repetition unit); typing of mycobacterial alternate repeat units (MIRU), typing of variable number of tandem repeats (VNTR), typing of clustered short palindromic repeats at regular intervals (CRISPR).
In one embodiment, the sequence readings are from a sample.
In one embodiment, the sequence files may be in the FASTQ format. Both forward and backward readings are supported.
The term "nucleic acid" refers to both deoxyribonucleic acid (DNA) and ribonucleic acid (RNA).
The term "alleles" refers to variant forms of the same gene, which occupy the same locus on homologous chromosomes and control the variants in the production of the same gene product.
BE2016 / 5082
The term 'locus' is defined as a collection of sequences of nucleic acids or amino acids that are genotypic (based on sequence identity) or phenotypic (based on observable characteristics or properties, such as morphology or development, biochemical, biological, functional or physiological properties, phenology , behavior, and products of behavior) are closely related. The nucleic acid or amino acid sequences in a locus are called alleles or variants for that locus.
A "string" is a sequence of characters.
A 'k-mer' is a nucleic acid or amino acid sequence string of length k. The word part '-more' refers to the unit of nucleotide or protein sequence. For a nucleotide sequence, that unit will be a base selected from A, C, T and G. Modified, analog or unnatural bases are also within the scope of the present invention, for example, but not limited to, I (hypoxanthine ), X (xanthine), m ⁷ G (7-methylguanine), D (5,6-dihydrouracil), m ⁵ C (5-Methylcytosine), 5hydroxymethylcytosine, aminoallyl nucleotide, isoguanine, isocytosine, the fluorescent 2-amino-6- (2-thienyl) purine, pyrrol-2-carbaldehyde, and the like. For an amino acid sequence, the unit will be an amino acid selected from A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y and V. Modified, analog or unnatural amino acids are also within the scope of the present invention, for example, but without being limited to, U or O. In addition, within the scope of the method of the invention, nucleic acid codes (such as K, M, R, Y, S, W, B, V, H, D, X, N, T / U, X / N,, -) or amino acid codes (such as B, Z, J, X) where the sequencing platform cannot determine with certainty the identity of such a residue.
A k-mer is usually a short sequence of nucleic acids or amino acids of length k, or in Computer terminology: a string of length k. For the non-skilled person this can also be understood as a word with a length k. The string or "word" used in a locus determines the characteristics of that locus, and to some extent makes it possible to distinguish one locus from another. The k-mer is preferably long enough to be specific, and short enough to fit within the lecture. When choosing the
BE2016 / 5082 length k of this short sequence should take into account sequencing errors. Since the k-mer of the reading will be exactly matched to the allele sequences, the k-mer found in the reading should not contain a sequencing error. Accordingly, the length of the k-mer should be small enough to fit between sequencing errors. Assuming, for example, that we have a read sequence length of 100 bp and a sequencing error of 1%, a 50 bp subsequent sequence containing no sequencing error can be expected to be found in each reading. Therefore, it makes no sense to take k-mers longer than 50, as they will very often contain a sequencing error and thus will not match any allele. On the other hand, k-mers that are quite short become less specific, causing more overlap between k-m profiles.
In the case of nucleic acids, the k length is preferably between 1 and 251 bases, more preferably between 19 and 65 bases, and most preferably is about 35 bases in length. In the case of amino acids, the k length is preferably between 1 and 81, more preferably between 5 and 41 amino acids, and most preferably is about 9 or 11 amino acids long. Preference is given to ks with odd numbers.
The terms "typing" and "sub-typing" used here refer to a molecular biology technique based on DNA sequence analysis that identifies, classifies and compares Organisms and their subtypes.
The term "reading sequence collection" usually refers to a multitude of reading sequences, i.e. two or more reading sequences, but in the case of assembled sequences, that collection may consist of only one reading sequence.
The sequence analysis strategy of the method of the present invention is based on a gene-by-gene approach. To start this analysis strategy, a frame of reference [43] is first created. This is based on a set of genomes, in which a number of contiguous regions (usually coding sequences or genes) are chosen from each genome, arranged in consistent sets (whereby a
BE2016 / 5082 such collection is called a locus), and for each locus the list of Variants found for this locus (called alleles) is reviewed. After such a frame of reference has been established, samples are compared based on allele variation: two samples have a degree of similarity depending on the number of loci present in both samples and having the same allele for that locus. The crucial point of the gene-by-gene approach to the analysis is therefore to determine for each locus whether it is present in the sample and, if so, in which allele variant. In genome sequence information in any form, the method of the invention detects i) the presence of known Variants of a locus in the genome data, and ii) the presence of a locus, in that they indicate the presence of until then predicts unknown variants of a locus in the genome data.
On the basis of the known Variants of a locus and the genome sequence information of a sample, the method according to the invention answers the following questions: i) which of the known Variants is / are present in the genome sequence information of the sample; and ii) which of the loci is / are present in the genome sequence information of the sample; iii) if none of the known variants is present in the sample: there is an unknown but (in terms of sequence identity) closely related sequence of nucleotides or proteins present in the genome sequence information of the sample.
For each locus, the method according to the invention first calculates a profile of the occurrence of the k-mer in this locus. A k-mer is a short sequence of nucleotides or proteins of length k, and can be thought of as a word of length k. The occurrence of a word in a locus determines its characteristics, and makes it largely possible to distinguish one locus from another. A k-mer profile is also calculated for each variant in a locus. Here, too, the word usage of an allele determines its characteristics and makes it largely possible to distinguish one locus from another.
The presence of a k mer profile for a specific allele in the sample genome sequence information indicates the presence of
B E2016 / 5082 the allele. If no allele-specific k-mer profile can be found, a good representation of the k-mer profile of the entire locus indicates the presence of an as yet unknown allele for this locus.
In one embodiment, in the system according to any of the embodiments set forth herein, said system comprises one or more of the following: a personal computer, a portable computer, a laptop computer, a netbook computer, a tablet computer, a smartphone, a digital photo camera, a video camera, a mobile communication device, a personal digital assistant, a scanner or a multifunctional device.
In one embodiment, in the non-volatile storage medium or computer program product according to any of the embodiments set forth herein, the computer device is selected from a personal computer, a portable computer, a laptop computer, a netbook computer, a tablet computer, a smartphone, a digitate photo camera, a video camera, a mobile communication device, a personal digital assistant, a scanner and a multifunctional device.
Description of the algorithm
Definition of alphabets, words, k-more space and k-more profiles
Suppose Σ is an alphabet consisting of s letters. The alphabets we are primarily interested in (but not limited to) are:
the DNA nucleotide alphabet
Σ _Ν τ = {A, C, G, T}, the RNA nucleotide _alphabet Σ _ΝΤ = {Λί, 6, ϋ}, and the amino acid alphabet
Σαα = {G, P, A, V, L, I, M, C, F, Y, W, H, K, R, Q, N, E, D, S, T}.
A word w on an alphabet Σ is a sequence of letters from the alphabet Σ. For every fairy N, a k-more is a word with size k. The set of aile fc lakes on the alphabet Σ is called the k lake set, and it consists of k ^s words.
BE2016 / 5082
For example, if we define the length of a multi-string as k = 2, in the case of DNA nucleotide sequences, since DNA has 4 different base types (A, C, T, G), a lake of k = 2 yield the following possible combinations (4 ^k = 4 ² ):
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT.
When working in the fc multiple set based on the nucleotide _alphabet Σ _νψ , we must consider forward and reverse sequences, as some of the combinations are equivalent to others: thus AC (in the direction 5 '->3' ) for example equivalent to GT (in the direction 3 '->5').
Therefore, for a lake of k = 2, the number of possible combinations is reduced to the following:
AA / TT AC / GT AG / CT AT
CA / TG CC / GG CG
GA / TC
GC
TA
Therefore, it is useful to look at the quotient space under the equivalence relationship defined by backward complementation. The backward complement of a word on the nucleotide _alphabet Σ _ΝΤ is the backward version (that is, from back to front) of the original word, in which each letter is replaced by its complement, that is:
A —7, C - "G, G -" C, T - "A.
For example, the backward complement of the word ACAGTCA is TGACTGT. The quotient space can be seen in terms of its representative elements (one for each equivalence class). A simple way to choose a representative element for each equivalence class would be
BE2016 / 5082 to choose the word lexicographically the smallest in each class. In the example, ACAGTCA would be the representative element for the {ACAGTCA, TGACTGT} class.
The reason for considering this equivalence relationship is that the input data from next generation sequencing devices is usually not oriented, in the sense that the same genome location can be read as one particular word, but also as its backward complement. That is of course not the case with amino acid sequences. Therefore, throughout the rest of this text, the term "fc multiple set" will be used to indicate either the current fc multiple set on the alphabet in question, or, in the case that the alphabet is the nucleotide alphabet, the quotient space of the current fc multiple set under the equivalence relationship of backward complement formation.
The k-mer space for the alphabet Σ is the multidimensional real vector space that has one dimension for each / c mer in the fc merger set.
Suppose w is a word on an alphabet Σ with a length of at least k. The fc-mer profile of the word w is a vector in the fc-mer space, where each number in the vector indicates how many times the fc-mer corresponding to that dimension has been used in w. For example, the word ACAGTCA on the bet Σ _νψ has the following 2-mer profile:
ΑΑ, ΤΤ AC, GT AG, CT AT CA, TG CC.GG CG GA.TC GC TA 0 2 1 0 2 0 0 1 0 0
Definition of gene-per-gene systems, loci and alleles
In the context of gene-per-gene systems, the generic terminology introduced above is often supplemented with a number of synonyms and additional concepts.
Suppose Σ is an alphabet. An allele a on the alphabet Σ is a word on the alphabet Σ. For example: the allele
TTAGAGCGCGCTGATATCGGTATTGACGCTAAAGCCGCGATCGAGGCTGACGCT
GTTGCCCGCCGCGTCGCTCACGACTGCCGTTACGGTGTAAGTGCCGTTGGTAAG
GCCTGCAAGGGTGTCGGCAGGCAGATTGACTTCCCAGCGGCCTGCGGAATCCAC is an allele derived from a nucleotide gene-per-gene system for the bacterial genus Cronobacter.
BE2016 / 5082
A locus I is a collection of alleles, or Variants, of the sequence above:
I {rii, U-2, r
The locus I corresponding to the ycbC gene of the bacterial genus
For example, Cronobacter consists of 15 alleles,
TCACTTCTGCCCTGGGTCGCCTGAGCCCACGCCTTTTACC ..., TCACTTCTGCCCTGGGTCGCCTGAGCCCACGCCTTTTACC ...,
TCACTCCTTCCCTGGGTCGCCTGATGCCACGCCTTTCATC ...
Only the first, second and 15th alleles are shown here, and for each allele only the first 40 bases are included.
A gene-per-gene system is a collection of loci,
L {Ιγ, 1-2.) ··, i-n} ·
For example, for the bacterial genus Cronobacter, there is a gene-pressing system consisting of 11,168 loci.
As can be seen from the above example, a locus does not usually group together a random set of alleles. There are criteria that determine which alleles are placed in the same locus, and which loci are included in a gene-per-gene system. For example, a criterion for accommodating an allele in a locus could be the minimum pairwise word spacing between the candidate allele and the known alleles already present in the locus, taking into account optimal alignment of each known allele and the candidate allele. In general, these criteria are based on biological principles.
The input data for the algorithm is i) a gene-per-gene system (represented as the set of loci, and thus its set of alleles) on an alphabet Σ; and (ii) a sample, represented as a collection of words on the same alphabet Σ.
Step 1: Constructing the Reference Data for Each Allele and Locus An fc mer profile is calculated for each of the alleles in each of the loci in the gene-per-gene system. The collection of FCMs used in that allele is maintained for each allele. Explained differently: we construct an allele / k-more incidence matrix. For example, we register for
BE2016 / 5082 allele 1 whether it contains every k-mer from the list with all possible combinations of k-mermen.
A C A G T C A
Table 1
AA 0 AC I AG I AT 0 CA II CC 0 CG 0 CT 0 GO 0 GC 0 GG 0 GT I TA 0 TC I TG 0 TT 0
Table 2
AA / TT 0 AC / GT 2 AG / CT 1 AT 0 ►CA / TG 2 CC / GG 0 CG 0 GA / TC 1 GC 0 TA 0
Table 3
Table 1 shows the number of occurrences of each of the 16 '2-mers' for allele 1. Table 2 takes into account the equivalent 2-more in the forward and reverse directions. In the incidence matrix, the matrix element (k-more, allele) a number (1,2, ...) if the allele containing the given number of times specific k-more, and otherwise 0. Since a (4 ^2/2) x will yield too many results, at least too much to fit in the memory of a computer, and many of these results are 0, one gets a more effective representation by registering only those things that actually occur.
Therefore, we reduce them to the matrix in Table 3 above.
B E2016 / 5082
Tabei 3 presents the incidence matrix showing the number of times each k-mer occurs in allele 1 (the k-mers that never occur have been removed to save computer memory).
Since one locus has several known alleles (Variants), we make the incidence matrix for each known allele.
We obtain a vector of strings and failures for each allele. For example, for alleles 1-3 of Locus 1, the vector will have the following expression (the vector can be constructed for as many alleles and loci as desired):
Locus 1 Allele 1 AC / GT: 2 AG / CT: 1 CA / TG: 2 GA / TC: 1 ACAGTCA Allele 2 AC / GT: 2 AG / CT: 1 CA / TG: 1 CC / GG: 1 GA / TC: 1 ACAGTCC Allele 3 AC / GT: 2 AG / CT: 2 CA / TG: 1 GA / TC: 1 ACAGTCT Locus 2 Allele 1 Allele 2 Allele 3 Allele 4 Locus 3 Allele 1 Allele 2
BE2016 / 5082
Step 2: screening of the sample data (readings) using the reference k multiple sets
For each of the / c-mers used in each of the alleles, it is determined how many times this fc-mer occurs in the words of the sample data, i.e., the readings. If the sample data contains not only words but also individual quality scores for each letter in a word, only high quality / c-mers are used in the readings. In particular, FC lakes can be rejected if too many low quality letters are present in one FC lake. An example of quality scores for bases are, but are not limited to, the PHRED scores, which indicate the likelihood of a base being incorrectly named. For example, a quality score of 20 indicates that the likelihood of the base being mis-named is 1/100. In other words, if you have 100 bases with quality 20, you can expect one of them to be wrong. With that in mind, these quality scores can be used to determine whether a k-lake is reliable or not. With a threshold for the minimum quality (for example 20) and a threshold for the maximum number of violations (for example 3), a k-mer is accepted if it contains a maximum of 3 bases with a quality of less than 20.
Usually, the determination of the presence or absence of the allele-associated k mer library within the read sequences can be made by scanning each of these read sequences with the k mer library. In a set of nucleotide sequences, also called readings, where these readings are not assembled, but may also be assembled in a single sequence or partially assembled in a few sequences, in practice we construct a search for allele 1 (ACAGTCA) with the boolean operator 'AND':
AC / GT: 2 AND AG / CT: 1 AND CA / TG: 2 AND GA / TC: 1
The software will perform the search for each reading comparing sequences for exact match of k-mers. If the k-mer set is found in the read sequences, it is scored "true" and counted, and the one or more matching alleles of the one or more loci are displayed.
BE2016 / 5082
Allele 1 readings
AC / GT: 2 AG / CT: 1 CA / TG: 2 GA / TC: 1
AC / GT: 69 AG / CT: 34 CA / TG: 60 GA / TC: 33
Step 3: filtering
Quality control measures are included in the method of the invention, with an unknown allele being disregarded in cases of poor sequence quality that are likely to result in sequencing errors.
The FC lakes are filtered based on the number of times they occur in the entire sample. In particular, all phenomena that occur less than a predetermined threshold value are rejected. When the nucleotide alphabet is used, equivalence classes of fe lakes for which not all equivalent fc lakes are present in the sample data can also be rejected. In other words, a k-mer (or, in this case, an equivalence class containing at most 2 k-mers) is considered to be present if and only if all of its representative elements can be found in the lectures. That is, each k-mer must be observed in the forward and reverse directions. This trick is used to avoid sequencing errors that occur in only one strand. In the event of such an error, you would get a wrong k-more in one direction, but never perceive it in the other direction, and consequently reject it.
The filtering step in the method of the present invention is essentially a quality control step, and can take many forms selected from:
calculating the funding ratio, which is the number of times the same k-more has been scanned, keeping the k-more if its coverage is higher than a predetermined threshold (which is set to 3 by default, but depends on the type of input data, which may be short readings or assembled or partially assembled sequences),
BE2016 / 5082 calculating the coverage of the individual representative elements of the equivalence class of a k-mer, keeping the k-mer if the coverage of all individual representative elements is higher than a predetermined threshold (which defaults to 1 but depends on the type of input data, which may be short readings or assembled or partially assembled sequences).
Step 4: allele detection
An allele is predicted to be present in the sample data if and only if (asa) all / c-mers in the fc-mer library before that allele can be found in the sample data (after filtering).
If the k mer set is not found in the read sequences, it is assigned the score 'vais' and the short readings showing a sequence identity of at least X% with the requested allele are taken, and are represented as an unknown allele related is with a known allele.
In particular, the following questions are answered:
Are all specific combinations of "words" (k-mer) of allele 1 present in the lectures, and do they occur an equal number of times
If yes, the allele 1 is present.
If not, the next question is: which votes do not match For example:
Allele 1 A C A G T C A reading A C A G T C G
Allele 1
Reading
AC / GT: 2 AG / CT: 1 CA / TG: 2 GA / TC: 1
AC / GT68 AG / CT: 34 CA / TG: 30 CG: 29 GA / TC: 33
Although the 4k mers of allele 1 are present in the lecture, that lecture additionally has a fifth k-mer CG, and the k-mer CA / TG is present in a proportion that is half of the presence in the allele 1 This indicates that we are probably dealing with an unknown allele, which affects us
BE2016 / 5082 subsequently allows to predict a new allele due to the sequence similarity.
Step 5: presence of locus
A locus is predicted to be present in the sample data if and only if (asa) for at least one of its alleles the number of fc-mers in the / c-mer set for that allele that can be found in the sample data (after filtering) exceeds a predetermined threshold.
To determine the presence or absence of one or more loci containing one or more alleles in a set of readings, which may not be assembled, but may also be assembled in a single sequence or partially assembled in a few sequences, Similarly, we list all the strings that match for each allele with the boolean operator 'AND', and we list all the strings that match each of the alleles of one locus with the boolean operator 'OR', so that a k mer profile, or an allele-associated k mer library per locus is obtained.
For example, for locus 1, which contains alleles 1 (ACAGTCA), 2 (ACAGTCC), and 3 (ACAGTCT), we compose a query constructed with the boolean operators "AND" and OR ", as follows:
AC / GT: 2 AND AG / CT: 1 AND CA / TG: 2 AND GA / TC: 1 OR
AC / GT: 2 AND AG / CT: 1 AND CA / TG: 1 AND CC / GG: 1 AND GA / TC: 1 OR
AC / GT: 2 AND AG / CT: 2 AND CA / TG: 1 AND GA / TC: 1
The software will perform the search for each reading, comparing sequences to the queried k-mer profile for loci 1.
Applications
The method of the invention can be used to (sub) characterize any kind of organism, be it prokaryotic or eukaryotic. The scientific literature reports several successful attempts to implement a gene-by-gene strategy for the analysis and comparison of biological samples across a wide range
BE2016 / 5082 to Organisms [25,26,27]. Moreover, the gene-by-gene methodology has been used for many years, albeit on a smaller scale and using first-generation Sanger sequencing, in the multi-locus sequence typing for bacteria [23,24],
The method of the invention can be applied to samples containing only one biological specimen (which are referred to as cultured samples) or to samples containing many different biological specimens (which are referred to as uncultivated samples). In the first case, the data generated by next generation sequencing from such a sample is usually referred to as complete genome sequence data (WGS sequence data), and in the second case, the data generated by next generation sequencing from such a sample commonly called shotgun metagenome data. In addition, the method of the invention can also be applied to expression sequence data such as RNAseq or ChipSeq.
The method of the invention can be applied to the analysis of sequence data of any type and origin. The sequence data can be nucleotide or amino acid type sequence data. The data may come from, but is not limited to, so-called i) first generation sequencing technologies (eg, Sanger sequencing); ii) next generation sequencing technologies (next generation, eg GenomeAnalyzer from Illumina, HiSEQ, MiSEQ, NextSEQ, PGM from loTorrent, Pacific Biosciences); iii) third generation sequencing technologies (eg Oxford NanoPore); or iv) any other current or future technology that yields nucleotide or amino acid sequences.
In one embodiment, the sequence readings can be obtained from full genome sequence data. Lectures are available over the Internet from databases such as SRA (Short Read Archive) from NCBI (National Center for Biotechnology Information), ENA (European Nucleotide Archive) from EBI-EMBL (The European Bioinformatics Institute, part of the
BE2016 / 5082
European Molecular Biology Laboratory), DDBJ Sequence Read Archive (DNA Data Bank of Japan).
There are also websites that provide MLST information, such as http://www.mlst.net and http://pubmlst.org, which provide databases for different types of pathogens, and there are now also sites for other types, for example from University College Cork (http://mlst.ucc.ie) and Institut Pasteur (http://www.pasteur.fr/mlst). These websites provide means to perform searches in the databases and to visualize the relationships between new isolates and the isolates present in the databases. The relationship between isolates characterized by MSLT and other molecular typing methods is typically visualized through cluster formation approaches, using, in the case of MLST, the differences in the allele profiles of isolates such that those most closely resembling an isolate of a search can be identified in a tree structure.
The typing methods of the present invention can be used for a variety of purposes, for example, to understand the phylogeny (evolution) and genetics of bacterial populations, to identify specific strains that spread worldwide in specific populations and / or core groups, for identifying time-related and geographical changes in strain types and the emergence and transmission of individual strains, checking similarities / differences between strains in contact tracking or in treatment testing, confirming / disproving treatment failure, resolving medical issues -legal issues such as sexual abuse, confirming suspected epidemiological links or distinguishing isolates from suspicious clusters and outbreaks. Strain typing, coupled with antimicrobial susceptibility data, contributes to a better understanding of the transmission of specific antibiotic resistant strains. Ultimately, such information can be used to design new preventive measures and interventions for public health.
BE2016 / 5082
The typing methods of the present invention can be used for accurate and reliable research in the macro-epidemiology (long-term global epidemiology) of infections caused by microorganisms, population dynamics over many years or decades, and phylogeny (evolution).
The methods of the present invention can be used in any single or multi-locus sequence typing scheme using non-targeted genome sequence information (targeted genome sequence information is already de facto associated with a specific locus, so does not require further locus or allele detection).
In particular, the method of the invention can be used, without being in any way limited to it, in multi-locus sequence typing for any bacterial organism, including traditional multi-locus sequence typing (MLST) as described by Maiden et al. [23]; extensive multi-locus sequence typing (eMLST) as described by Didelot et al. [24]; ribosomal multi-locus sequence typing (rMLST) as described by Jolley etal. [25]; full genome multi-locus sequence typing (wgMLST, MLST +) as described by Cody etal. [26] or Jolley et al. [27]; cgMLST for MTBC [41]; cgMLST for M RSA [42]; and any other typing scheme constructed according to the principles in the publications cited above.
The method of the invention can also be used to detect the presence / absence of specific sequences of nucleotides or proteins, such as spoligotyping for Mycobacterium Tuberculosis as described by Groenenefa /. [28] or Kamerbeek et al. [29]; large sequence detection polymorphism (LSP detection) for Mycobacterium Tuberculosis as described by Gagneux et al. [30], Kato-Maeda et al. [31], Tsolaki et al. [32], Hirsh etal. [33], or Fleischmann et al. [34],
The method of the invention can also be used to detect single base variants whose location on the genome is determined by its flanking regions, such as Taqman®-based SNP analysis.
BE2016 / 5082
The method of the invention can also be used for single locus sequence typing (or allele typing), including, for example, rifampicin resistance: rpoA, rpoB (Myco) (cf. Miller et al. [35] or Telenti et al [36]); M protein typing: emm (Spyo) (cf. Kaufhold et al. [37]); flab (Camp); spa (Saur); sipAst (Cdif); adhesion typing: FimH, and the like.
The method of the invention can also be used for antigen gene sequence typing (AGST) (cf. Colles and Maiden [38]); antibiotic resistance typing, eg Tellurium res (ter); antigen serotyping; SPATES typing (serine protease car transporters of enterobacteriaceae) (Ecoli, Shig); prediction of drug resistance in HIV, HBV, HCV, and the like.
Further applications include the micro-epidemiological analysis of strains, investigating the identity of isolates collected over short periods of time (days) up to a limited number of months, or even up to a maximum period of several years. This approach includes tribal typing in the following cases: community epidemics; strains in a complete population for a limited time; stem from core groups, larger core groups, or sexual networks; identifying the emergence and transmission of individual (e.g. antimicrobial resistant) strains; confirming or distinguishing suspected epidemiological links in suspicious infection clusters; contact tracking, treatment testing, and solving medical-legal issues; and characterizing bacterial clones.
The following examples are intended to illustrate the present invention in more detail, and are not to be understood in the sense that they limit the invention thereto.
Examples
Example 1: Allele forecasting in Campylobacter ieiuni sequences used a publicly available gene-per-gene system for the bacterial species Campylobacter jejuni [26], and a publicly available sample set of 36 samples. This allowed us to compare the results obtained by the algorithm of the invention with the publicly available results 2016/5082.
Isolate ID ENA-access no. Isolate ID ENA-access no. Isolate ID ENA-access no. OXC6347 ERR083963 OXC6461 ERR084072 OXC6564 ERR108328 OXC6407 ERR084021 OXC6632 ERR 108394 OXC6531 ERR084142 OXC6592 ERR108356 OXC6524 ERR084135 OXC6266 ERR083883 OXC6487 ERR084098 OXC6543 ERR084154 OXC6598 ERR108362 OXC6600 ERR108364 OXC6615 ERR108378 OXC6286 ERR083902 OXC6449 ERR084061 OXC6285 ERR083901 OXC6448 ERR084060 OXC6636 ERR108398 OXC6331 ERR083947 OXC6520 ERR084131 OXC6571 ERR108335 OXC6459 ERR084070 OXC6423 ERR084037 OXC6565 ERR108329 OXC6275 ERR083892 OXC6567 ERR108331 OXC6590 ERR108354 OXC6530 ERR084141 OXC6457 ERR084068 OXC6393 ERR084009 OXC6251 ERR083868 OXC6574 ERR108338 OXC6527 ERR084138 OXC6604 ERR108368 OXC6542 ERR084153
Tabei 1. Samples used in the example. The isolate IDs refer to the publicly accessible BIGSDb isolate database, the sequence data can be found on the European Nucleotide Archive (ENA).
Allele prediction was performed using the algorithm of the present invention, which took about 1 minute per sample on standard equipment, while the de novo procedure used in BIGSDb took about 15 to 20 minutes on the same equipment. For the 36 samples and the 58259 alleles found in these samples, among 58259 we found uniquely named loci (where 'unique' means that for each of the 58259 loci, a single allele was found) (or approximately 0.003%) 2 loci that violate between our results and the published allele assay. The differences in allele assays can be attributed to ambiguous positions that were resolved in an idiosyncratic manner in the de novo assembly process of BIGSDb.
Looking at isolate OXC6347, and in particular to the locus CAMP1442, we saw that it was named allele 69 by BIGSDb and by our algorithm allele 1. By comparing allele 69 and allele 1 we suggested
BE2016 / 5082 fixed a difference of one base between the two (T for allele 1 and G for allele 69); see Figure 1.
When we next looked at the de novo assembled sequence around position 82126, and the BLAST alignment of allele 69 (left) and allele 1 (right),
Allele 69 - BIGSDb Allele 1 - present invention
82115 TTTAT ^ CAGGGTTTCCTTATCTTGGAI 82115 TTTAIGGCAGGGTTTCCTTATCITGGAr llllllllllllllllllllllllill II II I II II I I II I II II III nil I
409 TTTATGGCÂGGGTITCCTIATCTTGGAî 409 TTTATGŒAG € TTreCCTTATCTIGGA <
we saw complete agreement with allele 69 and one mismatch with allele 1. This indicated that the allele in the genome assembled by BIGSDb de novo was actually the allele 69, not the allele 1. However, when we subsequently mapped the readings back to the de novo assembled genome, looking at the same position 82126, we saw that this was an ambiguous position, and that in some readings G appeared as base and in others T; see Figure 2.
In particular, we had a coverage of 85, with a G in 28 readings and a T. in 57 readings. All the readings with a G were also depicted in the forward direction, while the readings with a T were depicted in both directions (24 forward and 33 backward). We also noticed that there was another ambiguous base at position 82144 (coverage 94, with 27 Gs and 67 Ts, with all Gs in the forward direction, while the Ts were spread across the forward and reverse).
upward). Bee it view the lectures that these two ambiguous positions include, G ... G 16 unknown allele G ... T 1 allele 69 T ... G 11 unknown allele T ... T 37 allele 1, we saw that all combinations had more than 10 readings except the
combination G ... T, which only occurred once. Since the algorithm's minimum coverage parameters were set to 3 for the overall coverage ratio and 1 for the forward and reverse coverage BE2016 / 5082 degree, the allele 69 was not picked, but allele 1 was picked because it was the only known allele that both in the forward as well as in the reverse direction.
Both the BIGSDb method and the method of the present invention (with a k-more based approach) work with very high accuracy, but this example demonstrates that the use of de novo assembly without correcting the base designations after the assembly, involves a risk. Since the algorithm of the present invention goes directly back to the readings, there is no such confusion.
Example 2: reproducibility and variation within a patient
We will now demonstrate that the algorithm of the present invention provides improved accuracy and sensitivity compared to the de novo assembly based method. To this end, we again used the publicly accessible gene-per-gene system for the bacterial species Campylobacter jejuni [26], and two publicly accessible sample pools, the first of which contained 10 samples obtained once from the patient but sequenced twice, and the second contained 17 samples obtained twice from the patient and sequenced twice.
In the 10 Campylobacter bacteria isolated once but sequenced twice, the algorithm of the present invention found no differences (except effects of coverage). The de novo assembly procedure did show some variation (1 to 7 loci), especially for paralog genes. This was due to small, almost random choices made by the assembly algorithm in repeated or near repeated regions, and the lack of verification as to whether this region was properly assembled.
The biological variation in Campylobacteryyl / n / 'isolates from the same patient ranged from 0 to 10 different alleles as identified by the algorithm of the invention, and from 0 to 13 different alleles as identified by the BIGSDb algorithm. Average was
BE2016 / 5082 the variation 0.07% for the data of our algorithm, against 0.22% for the BIGSDb data.
Figure 3 shows the number of mismatched alleles as identified by BIGSDB (black) and by the algorithm of the present invention (white).
The cause of this more than threefold increase in the number of mismatched alleles was mainly due to almost random choices made by the assembly algorithm in repeated or near repeated regions, and the lack of verification as to whether this region was properly assembled.
List of references to journals
1. Cohen SH, Tang YJ, Silva J Jr. Molecular typing methods for the epidemiological identification of Clostridium difficile strains. Expert Rev Mol Diagn. 2001; 1 (1): 61-70.
2. Spratt BG. The 2011 Garrod Lecture: From penicillin-binding proteins to molecular epidemiology. J Antimicrob Chemother 2012; 67: 1578-1588
3. Knetsch CW, Lawley TD, Hensgens MP, Corver J, Wilcox MW, Kuijper EJ. Current application and future perspectives of molecular typing methods to study Clostridium difficile infections. Euro Surveillance. 2013; 18 (4): pii = 20381.
4. Voth DE, Ballard JD. Clostridium difficile toxins: mechanism of action and role in disease. Clin Microbiol Rev. 2005; 18 (2): 247-63.
5. Ison, C. A., et al. 2003. International comparison of molecular typing methods for Neisseria gonorrhoeae, ab 364. Abstr. 15th Int. Soc. Sex. Transm. Dis. Res. Congr., Ottawa, Canada.
6. Perez-Losada, M., K. A. Crandall, J. Zenilman, and R. P. Viscidi. 2007. Temporal trends in gonococcal population genetics in a high prevalence urban community. Infect. Genet. Evol. 7: 271-278.
7. Pérez-Losada, M., et al. 2007. Distinguishing importation from diversification of quinolone-resistant Neisseria gonorrhoeae by molecular evolutionary analysis. BMC. Evol. Biol. 7:84.
8. Perez-Losada, M., R. P. Viscidi, J. C. Demma, J. Zenilman, and K. A. Crandall. 2005. Population genetics of Neisseria gonorrhoeae in a
BE2016 / 5082 highprevalence community using a hypervariable outer membrane porB and 13 slowly evolving housekeeping genes. Mol. Biol. Evol. 22: 1887-1902
9. Tazi, L, et al. 2010. Population dynamics of Neisseria gonorrhoeae in Shanghai, China: a comparative study. BMC Infect. Dis. 10:13.
10. Magnus Unemo and Jo-Anne R. Dillon. Review and International Recommendation of Methods for Typing Neisseria gonorrhoeae Isolates and Their Implications for Improved Knowledge of Gonococcal Epidemiology, Treatment, and Biology. Clinical Microbiology Reviews, July 2011, p. 447-458
11. Pop & Salzberg 2008
12. Li etal. 2008
Langmead et al. 2009
14. Warren et al. 2007
15. Zerbino & Birney 2008
16. Simpson et al. 2009
17. Inouye M, Conway T, Zobel J, and Holt KE. Short read sequence typing (SRST): multi-locus sequence types from short reads. BMC Genomics 2012, 13:33
18. Coll F, Mallard K, Preston M, Bentley S, Parkhill J, McNerney R, Martin N, and Clark T. SpolPred: rapid and accurate prediction of Mycobacterium tuberculosis spoligotypes from short genomic sequences. Bioinformatics. 2012 November 15; 28 (22): 2991-2993.
19. Jolley K.A., Maiden M.C., BIGSdb: Scalable analysis of bacterial genome variation at the population level, BMC Bioinformatics. 2010 Dec 10; 11: 595. doi: 10.1186 / 1471-2105-11-595.
20. Hunter PR, Gaston MA. Numerical index of the discriminatory ability of typing systems: an application of Simpson's index of diversity. J Clin Microbiol. 1988; 26 (11): 2465-6.
21. van Belkum A, Tassios PT, Dijkshoorn L, Haeggman S, Cookson B, Fry NK, et al. Guidelines for the validation and application of typing methods for use in bacterial epidemiology. Clin Microbiol Infect. 2007; 13 Suppl 3: 1-46.
22. Nadon CA, Trees E, Ng LK, Meller Nielsen E, Reimer A, Maxwell N, Kubota KA, Gerner-Smidt P, the MLVA Harmonization Working Group.
BE2016 / 5082
Development and application of MLVA methods as a tool for interlaboratory surveillance. Euro Surveillance. 2013; 18 (35): pii = 20565.
23. Maiden, MCJ, Bygraves, JA, Feil, E., Morelli, G., Russell, JE, Urwin, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, DA, et al . (1998). Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proc. Natl. Acad. Soi. U. S. A. 95, 3140-3145.
24. Didelot, X., Urwin, R., Maiden, M.C.J., and Falush, D. (2009). Genealogical typing or Neisseria meningitidis. Microbiology 155, 3176-3186.
25. Jolley, KA, Bliss, CM, Bennett, JS, Bratcher, HB, Brehony, C., Colles, FM, Wimalarathna, H., Harrison, OB, Sheppard, SK, Cody, AJ, et al. (2012a) . Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiol. Read. Engl. 158, 1005-1015.
26. Cody, A.J., McCarthy, N.D., Rensburg, M.J. van, Isinkaye, T., Bentley, S., Parkhill, J., Dingle, K.E., Bowler, I.C.J.W., Jolley, K.A., and Maiden, M.C.J. (2013). Real-time genomic epidemiology of human Campylobacter isolates using whole genome multilocus sequence typing. J. Clin. Microbiol.
27. Jolley, K.A., Hill, D.M.C., Bratcher, H.B., Harrison, O.B., Feavers, I.M., Parkhill, J., and Maiden, M.C.J. (2012b). Resolution of a Meningococcal Disease Outbreak from Whole-Genome Sequence Data with Rapid WebBased Analysis Methods. J. Clin. Microbiol. 50, 3046-3053.
28. Groenen, P.M.A., Bunschoten, A.E., Soolingen, D. van, and Errtbden, J.D.A. from (1993). Nature of DNA polymorphism in the direct repeat cluster of Mycobacterium tuberculosis; application for strain differentiation by a novel typing method. Mol. Microbiol. 10, 1057-1065.
29. Kamerbeek, J., Schouls, L, Kolk, A., Agterveld, M. van, Soolingen, D. van, Kuijper, S., Bunschoten, A., Molhuizen, H., Shaw, R., Goyal, M., et al. (1997). Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J. Clin. Microbiol. 35, 907-914
30. Gagneux, S., DeRiemer, K., Van, T., Kato-Maeda, M., de Jong, BC, Narayanan, S., Nicol, M., Niemann, S., Kremer, K., Gutierrez , MC, et al. (2006). Variable host pathogen compatibility in Mycobacterium tuberculosis. Proc. Natl. Acad. Soi. U. S. A. 103, 2869-2873.
BE2016 / 5082
31. Kato-Maeda, M., Rhee, J.T., Gingeras, T.R., Salamon, H., Drenkow, J., Smittipat, N., and Small, P.M. (2001). Comparing Genomes within the Species Mycobacterium tuberculosis. Genome Res. 11, 547-554.
32. Tsolaki, A.G., Hirsh, A.E., DeRiemer, K., Enciso, J.A., Wong, M.Z., Hannan, M., Salmoniere, Y.-O.L.G. de la, Aman, K., Kato-Maeda, M., and Small, P.M. (2004). Functional and evolutionary genomics of Mycobacterium tuberculosis: Insights from genomic deletions in 100 strains. Proc. Natl. Acad. Sei. U.S. A. 101, 4865-4870
33. Hirsh, A.E., Tsolaki, A.G., DeRiemer, K., Feldman, M.W., and Small, P.M. (2004). Stable association between strains of Mycobacterium tuberculosis and their human host populations. Proc. Natl. Acad. Soi. U. S. A. 101, 4871-4876.
34. Fleischmann, RD, Alland, D., Eisen, JA, Carpenter, L., White, 0., Peterson, J., DeBoy, R., Dodson, R., Gwinn, M., Haft, D., et al. (2002). Whole-Genome Comparison of Mycobacterium tuberculosis Clinical and Laboratory Strains. J. Bacteriol. 184, 5479-5490.
35. Miller, L.P., Crawford, J.T., and Shinnick, T.M. (1994). The rpoB gene of Mycobacterium tuberculosis. Antimicrob. Agents Chemother. 38, 805-811
36. Telenti, A., Imboden, P., Marchesi, F., Lowrie, D., Cole, S., Colston, M.J., Matter, L., Schöpfer, K., and Bodmer, T. (1993). Detection of rifampicin resistance mutations in Mycobacterium tuberculosis. Lancet 341, 647-650
37. Kaufhold, A., Podbielski, A., Johnson, D.R., Kaplan, E.L., and Lütticken, R. (1992). M protein gene typing of Streptococcus pyogenes by nonradioactively labeled oligonucleotide probes. J. Clin. Microbiol. 30, 2391-2397
38. Colles, F.M., and Maiden, M.C.J. (2012). Campylobacter sequence typing databases: applications and future prospects. Microbiol. Read. Engl. 158, 2695-2709.
39. http://bioinformatics.net.au/software.velvetoptimiser.shtml
40. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215: 403-410. PubMed
41. http://jcm.asm.Org/content/52/7/2479
42.http: //jcm.asm.org/content/early/2014/04/17/JCM.00262-14.short
BE2016 / 5082
43. Hannes Pouseele, Bruno Pot, Setting up a genome-wide sequence typing scheme: how to overcome the pitfalls, in preparation.
BE2016 / 5082

权利要求:
Claims (18)
[1]
Conclusions
A method for determining the presence or absence of one or more predetermined nucleic acid sequences (referred to as "alleles") in a set of nucleic acid sequences (referred to as "reading sequences"), said method comprising:
a) defining a k mer, which is a nucleic acid sequence of length 'k' (for any natural number k), and a k mer space, which consists of all permutations of nucleic acids (4 ^k ) with the chosen length 'k', and in which those nucleic acid sequences that form each other's reverse complement (that is, that they are identical except for their difference in direction, where that direction is either a forward (5 '->3') or a reverse (3 ') -> 5 ') direction), are considered equivalent within the k-lake;
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in said each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) providing a collection of reading sequences;
d) for each of the one or more predetermined alleles and corresponding allele-associated k mer sets, determining the number of occurrences of each k mer (from the allele-associated k mer set) in said set of reading sequences, said k-mers equivalent in the forward (5 '-> 3') and reverse (3 '-> 5') direction are defined as equivalent k-mers within the k-moor set, but still separate from each other if desired. be distinguished; wherein the occurrence of a k-mer is revocable based on appropriate quality scores for individual nucleic acids in the reading sequence;
e) filtering the determined number of times each k mer (from the allele-associated k mer library) from step d); prevents; by the back
Set BE2016 / 5082 to 0 this number of times each k-more occurs if: i) the total number of times it occurs (either in forward (5 '-> 3') or in reverse (3 '-> 5') ) direction) is below a predetermined threshold, ii) the total number of occurrences in the forward (5 '-> 3') direction is below a predetermined threshold, or iii) the total number of occurrences in the backward (3 '-> 5') direction is below a predetermined threshold;
f) determining the presence or absence of each predetermined allele in the reading sequences based on the filtered number of times each k mer (from the allele-associated k mer library) obtained in step e).
[2]
2. A method for determining the presence or absence of one or more pools (referred to as loci ') containing one or more predetermined nucleic acid sequences (referred to as' alleles') in a set of nucleic acid sequences (referred to as' reading sequences'), said method comprising:
a) defining a k-mer, which is a nucleic acid sequence of length 'k' (for any natural number k), and a k-mer space, which consists of all permutations of nucleic acids (4 ^k ) of the selected length 'k', and wherein those nucleic acid sequences which form each other's reverse complement (i.e., except for their difference in direction are identical, said direction being either a forward (5 '->3') or a backward (3 '->5') direction), are considered equivalent within the k-lake;
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in said each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) for each locus, determining which one or more allele-associated k mer sets are present in a locus, and optionally determining
B E2016 / 5082 poles of the number of times each k-mer, as defined in step a), occurs;
d) providing a collection of reading sequences;
e) for each locus and associated allele-associated k mer collections, determining the presence of said one or more allele-associated k mer collections and the number of times each k mer (of each of the allele associated k mer collections ) occurs in said set of reading sequences, where those k-mers equivalent in the forward (5 '-> 3') and reverse (3 '-> 5') direction are defined as equivalent k-mers within the set of k-mers but can still be distinguished from each other if desired; wherein the occurrence of a k-mer is revocable based on appropriate quality scores for individual nucleic acids in the reading sequence;
f) filtering the determined number of times each k-mer (of each of the allele-associated k-mer sets) from step e); by resetting to 0 this number of times each k-more occurs if: i) the total number of times it occurs (either in forward (5 '-> 3') or backward (3 '-> 5') ) direction) is below a predetermined threshold, ii) the total number of occurrences in the forward (5 '-> 3') direction is below a predetermined threshold, or iii) the total number of occurrences in the backward (3 '-> 5') direction is below a predetermined threshold;
g) determining the presence or absence of a loci containing one or more predetermined alleles in the reading sequences based on the filtered number of times obtained each k-mer (of each of the allele-associated k-) multiple collections).
[3]
A method for determining the presence or absence of one or more predetermined amino acid sequences (referred to as "alleles") in a set of amino acid sequences (referred to as "reading sequences"), said method comprising:
BE2016 / 5082
a) defining a k-mer, which is an amino acid sequence with a length 'k' (for any natural number k), and a k multiple space, which consists of all permutations of amino acids (20 ^k -22 ^k ) with the chosen length 'k';
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) providing a collection of reading sequences;
d) for each of the one or more predetermined alleles and corresponding allele-associated k mer sets, determining the number of occurrences of each k mer (of the allele-associated k mer set) in said set of reading sequences, the occurrence of a k-mer is revocable based on appropriate quality scores for individual amino acids in the reading sequence;
g) filtering the determined number of times each k-mer (of the allele-associated k-mer collection) from step d); by resetting to 0 this number of times each k-more occurs if the total number of times it occurs is below a predetermined threshold value;
h) determining the presence or absence of each predetermined allele in the reading sequences based on the filtered number of occurrences of each k mer (from the allele-associated k mer library) obtained in step ë).
[4]
A method for determining the presence or absence of one or more pools (referred to as loci ') containing one or more predetermined amino acid sequences (referred to as' alleles') in a set of amino acid sequences (referred to as' reading sequences'), said method comprising:
BE2016 / 5082
a) defining a k-mer, which is an amino acid sequence with a length 'k' (for any natural number k), and a k multiple space, which consists of all permutations of amino acids (20 ^k -22 ^k ) with the chosen length 'k';
b) for each of the one or more predetermined alleles, determining which k-mer, as defined in step a), is present in said each of the one or more predetermined alleles, and optionally determining the number of times each k mer, as defined in step a), exists in said each of the one or more predetermined alleles, whereby a corresponding allele-associated k mer library is obtained;
c) for each locus, determining which one or more allele-associated k mer sets are present in a locus, and optionally determining the number of occurrences of each k mer, as defined in step a);
d) providing a collection of reading sequences;
e) for each locus and associated allele-associated k mer collections, determining the presence of said one or more allele-associated k mer collections and the number of times each k mer (of each of the allele associated k mer collections ) occurs in said library of reading sequences, where the occurrence of a k-mer is revocable based on appropriate quality scores for individual amino acids in the reading sequence;
f) filtering the determined number of times each k-mer (from each of the allele-associated k-mer collections) from step e); by resetting this number of times each k-more to 0 if the total number of times it occurs (either in forward (5 '-> 3') or in reverse (3 '-> 5') direction) is below a predetermined threshold;
g) determining the presence or absence of a loci containing one or more predetermined alleles in the reading sequences based on
B E2016 / 5082 of the filtered number of times each k mer (from each of the allele-associated k mer pools) obtained in step f).
[5]
The method according to any of claims 1 to 4, wherein, in the absence of a predefined allele or of
10 a loci containing one or more predefined alleles, the method further comprising determining the percent sequence identity of one or more of the k-mer as compared to the read sequences.
[6]
The method of any of claims 1 to 5, wherein the set of read sequences are unassembled, assembled, or partially assembled sequences.
[7]
The method of claim 6, wherein the unassembled sequences are obtained from a sequencing platform selected from Sanger sequencing, pyro sequencing, synthesis sequencing, and any other type yielding nucleic acid or amino acid sequences. .
[8]
The method of claim 6, wherein the preassembled 20 sequences, assembled or partially assembled, are obtained from any biological material, or from data in silico.
[9]
The method of claim 8, wherein the biological material is selected from one or more Organisms and any portion thereof.
[10]
The method of claim 9, wherein the one or more Organisms are selected from prokaryotes, including bacteria and archaea, viruses, fungi, microscopic arthropods, microscopic
30 crustaceans, any pathogenic, chimeric, or microorganisms created, and any mixture thereof.
[11]
The method of any one of claims 1 to 2, wherein the k is selected from 11 to 71 nucleic acids.
[12]
The method according to any of claims 3 to 4, wherein the k is selected from 5 to 23 amino acids.
BE2016 / 5082
[13]
The method of any of claims 1 to 12, wherein the method is used for typing or subtyping; multi-locus sequence typing (MLST); extensive multi-locus sequence typing (eMLST); ribosomal multi-locus sequence typing (rMLST); nuclear genome sequence typing; whole-genome multi-locus sequence typing (wgMLST, MLST +); spoligotyping for Mycobacterium Tuberculosis; detection of large sequence polymorphism (LSP) for Mycobacterium Tuberculosis; Taqman® based SNP analysis; single locus sequence typing (or allele typing); antibiotic resistance typing; antigen serotyping; SPA typing (serine protease car transporters from Enterobacteriaceae); prediction of drug resistance to HIV, HBV, or HCV; typing DRU (direct repeat unit); mycobacterial alternate repeating unit (MIRU) typing, variable number of tandem repeats (VNTR) typing, clustered short palindromic repeats at regular intervals (CRISPR).
[14]
A system for determining the presence or absence of one or more loci containing one or more predetermined nucleic acid or amino acid sequences or variants thereof, or for determining the presence or absence of one or more predetermined nucleic acid or amino acid sequences or variants thereof, in a collection of reading sequences, the system comprising at least one processor and an associated storage medium containing a program executable by said at least one processor, said system comprising software code portions performing the steps as defined in any of claims 1 to 4, in any logical order.
[15]
The system of claim 14, comprising one or more of the following: a personal computer, a portable computer, a laptop computer, a netbook computer, a tablet computer, a smartphone, a digital still camera, a video camera, a mobile communication device, a personal digital assistant, a scanner or a multifunction machine.
BE2016 / 5082
[16]
16. A non-volatile storage medium that stores a computer program product that includes software code portions in a format executable on a computer device and configured to
Perform 10 steps as defined in any of claims 1 to 4, in any logical order, when performed on said computer device.
[17]
A computer program product executable on a computer device and comprising software code for performing the method of any one of claims 1 to 13 when executed on said computer device.
[18]
The non-volatile storage medium of claim 16 or the computer program product of claim 17, wherein the computer device is selected from a personal computer, a portable computer, a laptop computer, a netbook computer, a tablet computer, a smartphone, a digital still camera, a video camera , a mobile communication device, a personal digital assistant, a scanner and a multifunctional device.
2016/5082 50

类似技术:

公开号 | 公开日 | 专利标题

Meisel et al.2016|Skin microbiome surveys are strongly influenced by experimental design

Read et al.2014|Characterizing the genetic basis of bacterial phenotypes using genome-wide association studies: a new direction for bacteriology

Didelot et al.2012|Transforming clinical microbiology with bacterial genome sequencing

Comas et al.2009|The past and future of tuberculosis research

Franzén et al.2015|Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering

Homolka et al.2012|High resolution discrimination of clinical Mycobacterium tuberculosis complex strains based on single nucleotide polymorphisms

Kato-Maeda et al.2013|Use of whole genome sequencing to determine the microevolution of Mycobacterium tuberculosis during an outbreak

Stucki et al.2013|Single nucleotide polymorphisms in Mycobacterium tuberculosis and the need for a curated database

Dulanto Chiang et al.2020|From the pipeline to the bedside: advances and challenges in clinical metagenomics

Olm et al.2020|Consistent metagenome-derived metrics verify and delineate bacterial species boundaries

van Belkum2003|High-throughput epidemiologic typing in clinical microbiology

BE1024766B1|2018-06-25|Method for typing nucleic acid or amino acid sequences based on sequence analysis

Qin et al.2016|Population structure and minimum core genome typing of Legionella pneumophila

Lavezzo et al.2013|Genomic comparative analysis and gene function prediction in infectious diseases: application to the investigation of a meningitis outbreak

Kamneva2017|Genome composition and phylogeny of microbes predict their co-occurrence in the environment

Margos et al.2020|Controversies in bacterial taxonomy: The example of the genus Borrelia

Keim et al.2008|Microbial forensics: DNA fingerprinting of Bacillus anthracis |

Anyansi et al.2020|QuantTB–a method to classify mixed Mycobacterium tuberculosis infections within whole genome sequencing data

Goig et al.2020|Whole-genome sequencing of Mycobacterium tuberculosis directly from clinical samples for high-resolution genomic epidemiology and drug resistance surveillance: an observational study

Liao et al.2006|Use of a multilocus variable-number tandem repeat analysis method for molecular subtyping and phylogenetic analysis of Neisseria meningitidis isolates

Dohál et al.2020|Whole-genome sequencing and Mycobacterium tuberculosis: Challenges in sample preparation and sequencing data analysis

Larsen et al.2017|The CGE tool box

Olm et al.2019|Consistent metagenome-derived metrics verify and define bacterial species boundaries

Abdel-Glil et al.2020|Phylogenomic analysis of Campylobacter fetus reveals a clonal structure of insertion element ISCfe1 positive genomes

Cody et al.2014|Multi-locus sequence typing and the gene-by-gene approach to bacterial classification and analysis of population variation

同族专利:

公开号 | 公开日

BE1024766A1|2018-06-21|

EP3051450A1|2016-08-03|

WO2016124600A1|2016-08-11|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US20030211504A1|2001-10-09|2003-11-13|Kim Fechtel|Methods for identifying nucleic acid polymorphisms|

WO2003087412A2|2002-04-10|2003-10-23|Applera Corporation|Mutation detection and identification|

WO2004099443A2|2003-05-08|2004-11-18|Febit Ag|Method for selection of optimal microarray probes|

US20130345066A1|2012-05-09|2013-12-26|Life Technologies Corporation|Systems and methods for identifying sequence variation|

US10395759B2|2015-05-18|2019-08-27|Regeneron Pharmaceuticals, Inc.|Methods and systems for copy number variant detection|

WO2018080477A1|2016-10-26|2018-05-03|The Joan & Irwin Jacobs Technion-Cornell Institute|Systems and methods for ultra-fast identification and abundance estimates of microorganisms using a kmer-depth based approach and privacy-preserving protocols|

CN109722485A|2018-11-20|2019-05-07|上海派森诺生物科技股份有限公司|A method of Rapid identification Human Fungi is sequenced based on sanger|

法律状态:
2018-08-29| FG| Patent granted|Effective date: 20180625 |

优先权:

申请号 | 申请日 | 专利标题

EP15153406.2|2015-02-02|

EP15153406.2A|EP3051450A1|2015-02-02|2015-02-02|Method of typing nucleic acid or amino acid sequences based on sequence analysis|

[返回顶部]